Interpreting random forest analysis of ecological models to move from prediction to explanation

As modeling tools and approaches become more advanced, ecological models are becoming more complex. Traditional sensitivity analyses can struggle to identify the nonlinearities and interactions emergent from such complexity, especially across broad swaths of parameter space. This limits understanding of the ecological mechanisms underlying model behavior. Machine learning approaches are a potential answer to this issue, given their predictive ability when applied to complex large datasets. While perceptions that machine learning is a “black box” linger, we seek to illuminate its interpretive potential in ecological modeling. To do so, we detail our process of applying random forests to complex model dynamics to produce both high predictive accuracy and elucidate the ecological mechanisms driving our predictions. Specifically, we employ an empirically rooted ontogenetically stage-structured consumer-resource simulation model. Using simulation parameters as feature inputs and simulation output as dependent variables in our random forests, we extended feature analyses into a simple graphical analysis from which we reduced model behavior to three core ecological mechanisms. These ecological mechanisms reveal the complex interactions between internal plant demography and trophic allocation driving community dynamics while preserving the predictive accuracy achieved by our random forests.


S1 -Model development
In this section we use the online database COMPADRE to investigate model structure, plausible 17 parameter values, and simulation design. R scripts to run our analysis are presented in SI github. The 18 analysis presented below was completed in the later part of 2021 and COMPADRE is a growing database. 19 Therefore, attempts to recreate our analysis at a later date are likely to reflect changes made to the 20 database. 21 22 S1.1 -Model structure 23 24 Three distinct stages were used as the ontogenetic structure of the plant population in our model: 25 fecund adults ( ), non-fecund seedlings ( 2 ), and the seed bank ( 1 ) (Fig 1). While no single ontogenetic 26 structure can perfectly represent all plant species, three stages is a suitable baseline for the purposes of 27 this study. First, a three-stage structure creates a tractable dimensionality in our model which allows for 28 an in depth analysis of interdependent dynamic effects born out of interacting demographic/ecological 29 rates. Importantly, this allows us to more clearly graphically represent the consequences and details of 30 interacting rates which would be less apparent in higher dimensional ontogenetic formulations. Second, 31 using the growing global database of stage-structured plant demographic data, COMPADRE 42 , we can see 32 that the three stage ontogeny is well-represented in plant taxa empirically. Of all the documented plant 33 taxa within the COMPADRE database, three-stage plant structures represent roughly 34% of families, 34 18% of genera, and 14% of species. This includes abundant, species-rich, economically important, and 35 geographically wide-spread plant families such as Asteraceae, Brassicaceae, Orchidaceae, Rosaceae, etc. 36 The plant population in the model experiences density dependent restrictions on the production of 37 seeds via the maximum function which restricts seed production to a minimum of zero in relation to the 38 density of F. Stage transitions are also modified by density dependent pressure with limited density 39 dependent effects from "younger" via the parameter (see Eq1 & Table 1). Finally, consumer pressure 40 from the herbivore population ( ) can differentially focus its herbivory on either the seedling stage ( 2 > 41 0, = 0), the fecund adult stage ( 2 = 0, > 0), or both ( 2 > 0, > 0 The COMPADRE database also allows for further insight into our model. We can use the 57 available data to inform certain model parameters. The COMPADRE reproductive and stage transition 58 rates mirror our demographic parameters. Production of stage 1 individuals from stage 3 corresponds to 59 seed production ( ). Transition from stage 1 to 2 corresponds to seed germination ( 12 ) and transition 60 from stage 2 to 3 corresponds to seedling maturation ( 2 ). However, these rates represent data from a 61 range of different studies using a range of experimental, natural, and agricultural plant demographic data.

62
Additionally, these studies involve transition rates which emerge from both competition and trophic 63 interactions. Therefore, measurable field rates are frequently emergent from environmental conditions and 64 are not necessarily the inherent demographic rates. As such, we do not intend to use these empirical rates 65 to determine exact values of our model's demographic parameters. Instead, we use the ranges of each 66 empirical rate to inform plausible ranges for their corresponding model parameter to be analyzed in 67 parameter sweeps of model simulations.

68
In doing so, we limited our survey of values from each parameter to those greater than 0 as the 69 effect of setting any demographic parameter to 0 in the model is obviously a steady fall to extinction (Fig  70  S1). Starting with rates of reproduction (Fig S1a), we see a large range of reproduction into the first stage 71 from the third, from 0 to roughly 100. However, we limited our preliminary analysis to < 10 as 87% 72 of sampled values fall within this range (Fig S1a, Insert). The rates of transition between stages are much 73 more evenly distributed from values >0 to 1. While not a completely even distribution across the range, 74 we take this as sufficient reason to study the full range of values, 0.1 < 12 < 0.9 and 0.1 < 2 < 0.9.

75
With these results in mind (Fig S1), ranges for these three parameters are provided in Table 1 along with  76 both parameter definitions and the ranges/values used for every model parameter. Transition rates from stage 1 to 2. Transition rates from stage 2 to 3. 82 83 S1.

-Simulation Design & Output 84 85
Potentially the most important insight garnered from available data in COMPADRE, we see 86 absolutely no correlation between any of the measured rates. There is no discernable relationship between 87 empirical rates of reproduction and maturation from either stage 1 to 2 or 2 to 3 (visualized in Fig S2a &  88 b). Also, we see no correlation between the stage transition rates (visualized in Fig S2c). The lack of any 89 relationships in the values of any rate prompted a full factorial investigation of the model's demographic 90 rates ( , 12 , 2 ; see Table 1 for value range), without any necessary covariation in parameter values. 91 In other words, all demographic parameter values were factorially tested against each other in a large 92 parameter sweep without the need to change any one parameter value in concert with another. Past work 93 also indicates interactivity between demographic rates and trophic interactions in driving model 94 dynamics 19 , so we considered trophic interactions in our simulation design by including rates of herbivory 95 ( & 2 ) into what becomes a five dimensional fully factorial parameter sweep. Herbivore attack rate 96 ranges were chosen heuristically (Table 1). Herbivore attack rates on either consumed plant stage (see Fig  97 1d-1i) vary factorially in the parameter sweep as herbivores can range from focusing their attack on either 98 stage to splitting their consumption between stages to varying degrees. & 2 vs transition rates between stages 2 & 3.

106
As a mode of sensitivity testing, we implemented the five parameter factorial sweep across 107 multiple values for both density dependent parameters and handling times. Density dependent parameters 108 were varied because density dependence mediates stage transitions and therefore ontogenetic dynamics 19 . 109 Handling time was chosen for its role in mediating the trophic connections interacting with plant 110 ontogeny. This created eight unique instances of the full five parameter factorial sweeps (see values in 111  Table 1), producing over 5.5 million unique simulations for analysis. 112 Each simulation outputs a number of initial conditions and post simulation factors detailing the 113 results of the simulation. Initial conditions include all relevant parameter values and initial time 114 dependent variable densities (i.e., initial population densities such as ( ) at time 0). Post simulation 115 factors include equilibria values, equilibria linear stability, eigenvalues, periodicity, oscillating 116 populations' peak and trough densities, and finally the effective handling time of the herbivore 117 population. The effective handling time of the herbivore population is measured as the denominator of the 118 consumptive interaction in the model. 119 120

122
Due to the large simulation dataset, we first produced an initial guided analysis using the Random 123 Forest based machine learning algorithm which can achieve high predictive power 46 and has shown 124 success in using permutation techniques to determine how much specific predictors contribute to that 125 predictive ability (e.g., mean accuracy decrease 47 ). In our case, our five main parameters (see Table 1 Simulation data was split into "training" and "validation" (sometimes called test) data subsets for 135 hold out cross validation. We first created or "trained" the random forest on the training data. In Random 136 Forests, this process produces a series of unique categorization and regression trees that "vote" on the 137 outcome (our simulation model's stability) based on the values of any particular inputs (our model 138 parameters). As a default during training, random forest parameter "mtry" was set at floor(sqrt(p)) for 139 categorization tasks (stable vs unstable) and floor(p/3) for regression tasks (max eigenvalue) where p=# 140 of features 48 . Instances where a different p produced better results are noted in the text. The parameter 141 "ntrees" (No. of trees) was varied from 300-600 with little to no effect on performance. 142 Once we have a trained random forest, we check its performance on the training data via the "out 143 of the box" (OOTB) error rate. Given a sufficiently low error rate, we can begin to investigate feature 144 importance. We measured the importance of individual features/parameters in our random forest with 145 Mean Accuracy Decrease, which measures the loss in predictive accuracy by excluding each feature. The 146 more the accuracy suffers, the more important the variable is for the successful classification/prediction. 147 For a more detailed description of the mechanisms behind creating and training random forests, please see Sufficiently high performing trained random forests were then used to predict simulation output 153 in our validation data subsets which our random forests had not yet been exposed to. We analyzed the 154 predictive power of categorization tasks by comparing their predicted output with data via the Area Under 155 Curve the Receiver Operating Characteristic curve (AUC) metric (pROC package). In our case, the AUC 156 metric measures how well the models are able to distinguish between stable and unstable results in the 157 validation data subset. It varies between 0 and 1 with 1 indicating better predictions. Regression tasks 158 were judged for accuracy using RMSE on maximum eigenvalue measurements between the predicted 159 eigenvalue output and the simulation data. All AUC and RMSE values are from validation data unless 160 otherwise specified. 161 162

S2.3 -Interpreting Feature Effects 163 164
Random Forest results can also be further interpreted with the H statistic which functions as a 165 measure of interactivity between features/predictors in driving prediction results 49 . This can be done for 166 each individual predictor as a measure of general interactivity with other predictors or be done with a 167 focus on particular predictors to study direct interactivity between specific predictor combinations. 168 Regardless, higher numbers indicate higher interaction strengths while lower numbers indicate less 169 interaction between predictors. While some have argued that the basic random forest measurements of 170 variable importance (e.g., mean accuracy decrease) "capture" the outcome of interactions between 171 features in predictions, these metrics are not designed to detect interactions per se 50 . Therefore, our use of 172 the H statistic helps us hone in on key interactions, a particularly useful outcome for any researcher 173 looking to implement our methods on a model with a larger parameter set. 174 While the H statistic provides inference regarding which features have strong interactive effects, 175 other methods are required to determine how features (interacting or otherwise) actually affect the 176 predicted outcomes. To do this we used two analytical techniques in the iml package in R 51 . First, we used 177 Partial Dependence plots (PD plots) to visualize the marginal effect of one or two features on the 178 predicted outcome of our random forests 52 . These PD plots can show whether the relationship between the 179 target and a feature is linear, monotonic or more complex. By focusing on two-feature partial dependence 180 plots, we can also see how these features interact in changing model predictions (e.g., Fig 2). Second, we 181 also used Individual Conditional Expectation (ICE) curves to uncover heterogeneous relationships by 182 showing individual instances of changing a feature's value at different permutations of the other 183 features 53 . In using these ICE plots, we found quick evidence of the context dependent relationship of 184 features and their effect on model predictions (Fig S3), again helping us determine where interactions 185 matter. Our random forests can produce a highly accurate level of predictive power. These levels of 202 predictive ability can induce questions of data leakage in producing and testing our models. We claim this 203 is not the case here and detail our reasons below. First, the two easiest sources of data leakage are 204 including the target variable as a feature in creating and training our models while the other is accidental 205 inclusion of test/validation data in our training data during model training. Neither of these occurred due 206 to simple due diligence in model creation. Our random forest code can be used to verify these claims. 207 Second, we have no "give away" features which are effectively tied to our target variables. Third, results 208 are not driven by particular outliers and are consistent across different subsets used as training and 209 test/validation datasets. Finally, models' variable importance changes in explanatory ways across 210 different subsets of data (e.g., different consumption allocations) and our results are supported by our 211 graphical analysis (Fig 2) which does not fall victim to data leakage issues. Therefore, we can have high 212 confidence that our random forest results do not reflect data leakage. 213 Despite the high levels of predictive power (e.g., results shown in Fig. 1), our random forests did 214 have limits on their immediate interpretability as noted by others (e.g. ref 54 ). Therefore, we use our 215 random forest models not as end points on their own, but as tools to direct our analysis across such a large 216 amount of simulation data. By determining feature importance, effect, and interactivity, we were able to 217 hone in specific subsections of the simulation model's parameter space and utilize graphical analysis to 218 expand our ecological understanding of our results. 219 220  using a 3-way tensor product smooth on 1 , 2 , and . a) Results from categorization task (stable as 1 293

S3 -Supplementary Analysis Results
and unstable as 0). Producing this GAM took substantially more time to calculate than the random forest, 294 but it does recreate our results. b) Results from regression task (maximum eigenvalue). "Hotter" colors 295 indicate "lower" values. In the categorization task, this indicates higher probabilities of as unstable 296 equilibrium (oscillations). In the regression task, conversely this indicates smaller eigenvalues. Overall, 297 we can see how the GAMs can recreate the patterns found in our random forest analysis once we have 298 developed the necessary parametric hypothesis from our random forest results. Although the 12 & 2 parameters and 12 & 2 ecological sub-functions we labeled "factors" 303 (Table 1) are related, their relationship is not 1:1. Specifically, 12 and 2 are parameters in the model 304 representing the per-capita germination/maturation rates of seeds/seedlings. The fixed values of these 305